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Variability in data streams 
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Abstract 

We consider the problem of tracking with small relative error an integer function /(n) defined by 
a distributed update stream f'{n). Existing streaming algorithms with worst-case guarantees for this 
problem assume f(n) to be monotone; there are very large lower bounds on the space requirements for 
summarizing a distributed non-monotonic stream, often linear in the size n of the stream. 

Input streams that give rise to large space requirements are highly variable, making relatively large 
jumps from one timestep to the next. However, streams often vary slowly in practice. What has heretofore 
been lacking is a framework for non-monotonic streams that admits algorithms whose worst-case perfor¬ 
mance is as good as existing algorithms for monotone streams and degrades gracefully for non-monotonic 
streams as those streams vary more quickly. 

In this paper we propose such a framework. We introduce a new stream parameter, the “variability” 
V, deriving its definition in a way that shows it to be a natural parameter to consider for non-monotonic 
streams. It is also a useful parameter. From a theoretical perspective, we can adapt existing algorithms 
for monotone streams to work for non-monotonic streams, with only minor modifications, in such a way 
that they reduce to the monotone case when the stream happens to be monotone, and in such a way that 
we can refine the worst-case communication bounds from 0(n) to 0{v). From a practical perspective, 
we demonstrate that v can be small in practice by proving that v is 0(log/(n)) for monotone streams 
and o(n) for streams that are “nearly” monotone or that are generated by random walks. We expect v 
to be o(n) for many other interesting input classes as well. 


1 Introduction 

In the distributed monitoring model, there is a single central monitor and several (fc) observers. The observers 
receive data and communicate with the monitor, and the goal is to maintain at the monitor a summary of 
the data received at the observers while minimizing the communication between them. 

This model was introduced by Cormode, Muthukrishnan, and Yi 0 0 with the motivating application 
of minimizing radio energy usage in sensor networks, but can be applied to other distributed applications 
like determining network traffic patterns. Since the monitor can retain all messages received, algorithms in 
the model can be used to answer historical queries too, making the model useful for auditing changes to and 
verifying the integrity of time-varying datasets. 

The distributed monitoring model has also yielded several theoretical results. These include algorithms 
and lower bounds for tracking total count a 0 [in] [H], frequency moments mmmm. item frequencies 
0 mi HD US] [I3; quantiles 0 [2] [2] [TS] [T7], and entropy 0 [2] [2] to small relative error. 

However, nearly all of the upper bounds assumed that data is only inserted and never deleted. This is 
unfortunate because in the standard turnstile streaming model, all of these problems have similar algorithms 
that permit both insertions and deletions. In general, this unfortunate situation is unavoidable; existing 
lower bounds for the distributed model 0 demonstrate that it is not possible to track even the total item 
count in small space when data is permitted to be deleted. 

That said, when restrictions are placed on the types of allowable input, the lower bounds evaporate, and 
very nice upper bounds exist. Tao, Yi, Sheng, Pei, and Li [2] developed algorithms for the problem of 
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summarizing the order statistics history of a dataset D over an insertion/deletion stream of size n, which 
has an 0(n)-bit lower bound in general; however, they performed an interesting analysis that yielded online 
and offline upper bounds proportional to 1/1 ^(01; with a nearly matching lower bound. A year or two 

later, Liu, Radunovic, and Vojnovic m E] considered the problem of tracking \D\ under random inputs; 
for general inputs, there is an r2(n)-bit lower bound, but Liu et. al. obtained (among other results) expected 
communication costs proportional to i/nlogn when the insertion/deletion pattern is the result of fair coin 
flips. 

In fact, the pessimistic lower bounds for the general case can occur only when the input stream is such 
that the quantity being tracked is forced to vary quickly. In the problems considered by Tao et. al. and Liu 
et. ah, this occurs when \D\ is usually small. These two groups avoid this problem in two different ways: 
Tao et. al. provide an analysis that yields a worst-case upper bound that is small when |ZI| is usually large, 
and Liu et. al. consider input classes for which |I?| is usually large in expectation. 

Our contributions In this paper we propose a framework that extends the analysis of Tao et. al. to the 
distributed monitoring model and that permits worst-case analysis that can be specialized for random input 
classes considered by Liu et. al. In so doing, we explain the intuition behind the factor of 
the bounds of Tao et. al. and how we can separate the different sources of randomness that appear in the 
algorithms of Liu et. al. to obtain worst-case bounds for the random input classes we also consider. 

In the next section we derive a stream parameter, the variability v. We prove that v is 0(log/(n)) for 
monotone streams and o{n) for streams that are “nearly” monotone or that are generated by random walks, 
and find that the bounds of Tao et. al. and Liu et. al. are stated nicely in terms of v. In section [3] we 
combine ideas from the upper bounds of Tao et. al. m with the existing distributed counting algorithms of 
Cormode et. al. a 0 and Huang, Yi, and Zhang [5] to obtain upper bounds for distributed counting that 
are proportional to v. In section |4] we show that our dependence on v is essentially necessary by developing 
deterministic and randomized space-l-communication lower bounds that hold even when v is small. We round 
out the piece in section [5] with a discussion of the suitability of variability as a general framework, in which 
we extend the ideas of section [3] to the problems of distributed tracking of item frequencies and of tracking 
general aggregates when k = 1. 

But before we jump into the derivation of variability, we define our problem formally and abstract away 
unessential details. 

Problem definition The problem is that of tracking at the coordinator an integer function /(n) defined 
by an update stream f'{n) that arrives online at the sites. Time occurs in discrete steps; to be definite, the 
first timestep is 1, and we define /(O) = 0 unless stated otherwise. At each new current time n the value 
f'{n) = /(n) — /(n—1) appears at a single site i{n). 

There is an error parameter e that is specified at the start. The requirement is that, after each timestep 
n, the coordinator must have an estimate f{n) for f{n) that is usually good. In particular, for deterministic 
algorithms we require that Vn, \f{n) — f{n)\ < sf{n), and for randomized algorithms we require that 
Vn, P(|/(n)-/(n)| < sf{n)) > 2/3. 

2 Variability 

In the original distributed monitoring paper [4], Cormode et. al. define a general thresholded problem 
(/c,/, T, e). A dataset D arrives as a distributed stream across k sites. At any given point in time, the 
coordinator should be able to determine whether f{D) > r or f{D) < (1 —ejr. 

In continuous tracking problems, there is no single threshold, and so f{n) is tracked to within an additive 
eT{n), where r(n) also changes with the dataset D(n). Since r is now a function, it needs to be defined; 
the usual choice is / itself, except for tracking item frequencies and order statistics, for which (following the 
standard streaming model) t is chosen to be \D\. That is, the continuous monitoring problem (A:, /, e) is, at 
all times n maintain at the coordinator an estimate /(n) of f{n) so that |/(n) —/(n)| < ef{n). 
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The motivation for the way we define variability is seen more easily if we first look at the situation as 
though item arrivals and communication occur continuously. That is, over n = [0.1,0.2] we receive the 
second tenth of the first item, for example. At any time t at which / changes by ±e/, we would need to 
communicate at least one message to keep the coordinator in sync; so if / changes by dt then we should 
communicate messages. 

With discrete arrivals, dt = 1, and we define f'{t) = f{t) — /(t—1). Otherwise, the idea remains the same, 
so we would expect the total number of messages to look like l^^l> '''^here here f'{t) = f(t) — 

In sections [3] and m we find that, modulo the number k of sites and constant factors, this is indeed the case. 

Being a parameter of the problem rather than of the stream, we can move the 1/e factor out of our 
definition of variability and bring it back in along with the appropriate functions of k when we state upper 
and lower bounds for our problem. This permits us to treat the stream parameter v independently of the 
problem. We also need to handle the case / = 0 specially, which we can do by communicating at each 
timestep that case occurs. This means we can define \ y^\ = 1 when f{t) — 0. 

Taking all of these considerations into account, we define the f-variability of a stream to be v{n) = 
Y/'t=i minjl, \ y^\}. We also write v'{t) = min{l, \ y^\} to be the increase in variability at time t. We say 
“variability” for /-variability in the remainder of this paper. 

From a practical perspective, we believe low variability streams to be common. In many database 
applications the database is interesting primarily because it tends to grow more than it shrinks, so it is 
common for the size of the dataset to have low variability; as more items are inserted, the rate of change 
of jUj shrinks relative to itself, and about as many deletions as insertions would be required to keep the 
ratio constant. In the following subsection, we prove that monotone and nearly monotone functions have 
low variability and that random walks have low variability in expectation, lending evidence to our belief. 

From a theoretical perspective, variability is a way to analyze algorithms for e relative error in the face 
of non-monotonicity and generate provable worst-case bounds that degrade gracefully as our assumptions 
about the input become increasingly pessimistic. For our counting problem, it allows us to adapt the 
existing distributed counting algorithms of Cormode et. al. 0 0 and Huang et. al. [5] with only minor 
modifications, and the resulting analyses show that the dependence on k and e remains unchanged. 

2.1 Interesting cases with small variability 

We start with functions that are nearly monotone in the sense that they are eventually mostly nondecreasing. 
We make this precise in the theorem statement. 

Theorem 2.1. Let f~(n) = J2ff'(t)<o — St-/'(t)>o If there is a monotone nonde¬ 

creasing function j3{t) > 1 and a constant to such that for all n > to we have f~{n) < /3(n)/(n), then the 
variability \ f'it)/f{t)\ is 0(^(n) log(^(n)/(n))). 

The proof, which we defer to appendix partitions time into intervals over which f~^ (t) doubles and 
shows the variability in each interval to be 0{/d{n)). When /(n) is strictly monotone, /3(n) = 1 suffices, and 
the theorem reduces to the result claimed in the abstract. As we will see in section [3l our upper bounds will 
simplify in the monotone case to those of Cormode et. al. g] [5] and Huang et. al. 0. 

Next, we compute the variability for two random input classes considered by Liu et. al. [TU] [H]. This 
will permit us to decouple the randomness of their algorithms from the randomness of their inputs. This 
means, for example, that even our deterministic algorithm of section [3] has o(n) cost in expectation for these 
input classes. The first random input class we consider is the symmetric random walk. 

Theorem 2.2. If f'(t) is a sequence of i.i.d. ±I coin flips then the expected variability E(v{n)) = 
0{y/n\ogn). 

Proof. The update sequence defines a random walk for f{t), and the expected variability is 

n n t 

^p(/(t)=o) + Y.Y.‘^p{f{t)=s)/s 

t—l t—l S=1 
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We use the following fact, mentioned and justified in Liu et. al. m- 
Fact 2.3. For any t > 1 and s G we have P{f{t) = s) < where C\ is some constant. 

Together, these show the expected cost to be at most 

n n 

Cl ^(1 + 2iJt)/v^ < C2 log(n) ^ 1/Vt < C 3 \og{n)y/n 

since (1 + 2iL„) < ^ log(n) and 1/v^ — ^ /" 1/v^ dt. □ 

The second random input class we consider is i.i.d. increments with a common drift rate of ^ > 0. The 
case /i < 0 is symmetric. We assume that yi is constant with respect to n. The proof is a simple application 
of Chernoff bounds and is deferred to appendix |51 

Theorem 2.4. If f'{t) is a sequence of i.i.d. ±1 random variables with P[f'(f) = 1) = (1 + /r)/2 then 
E{v{n))=O0^). 


Remarks We can restate the results of Liu et. al. m m and Tao et. al. m in terms of vari¬ 
ability. For unbiased coin flips, Liu et. al. obtain an algorithm that uses O^ — ^/nlogn) messages (of 
size O(logn) bits each) in expectation, and for biased coin flips with constant /i, an algorithm that uses 
0(-^l^(logn)^+'^) messages in expectation. If we rewrite these bounds in terms of expected variability, 
they become 0{^E{v{n))) and O(log nYE{v{n))), respectively. In the next section, we obtain (when 

k = 0(l/e^)) a randomized bound of 0{^v(ji)). In marked contrast to the bounds of Liu et. al., our bound 
is a worst-case lower bound that is a function of u(n); if the input happens to be generated by fair coin flips, 
then our expected cost happens to be 0(-^-v/nlogn). 

The results of Tao et. al. are for a different problem, but they can still be stated nicely in terms of the 
1131-variability v{n): for the problem of tracking the historical record of order statistics, they obtain a lower 
bound of r2(iu(n)) and offline and online upper bounds of 0((Mog^ y)^^(^)) and 0(^u(n)), respectively. 
We adapt ideas from both their upper and lower bounds in sections [3] and H) 

3 Upper bounds 

In this section we develop deterministic and randomized algorithms for maintaining at the coordinator an 
estimate f{n) for f(n) that is usually good. In particular, for deterministic algorithms we require that 
Vn, |/(n) —/(n)| < ef{n), and for randomized algorithms that Vn, P(|/(n) —/(n)| < ef{n)) > 2/3. We 
obtain deterministic and randomized upper bounds of 0{^v{n)) and 0{{k+ -^)u(n)) messages, respectively. 
For comparison, the analogous algorithms of Cormode et. al. 0 0 and Huang et. al. [5] use 0(|logn) 
and 0{{k + logn) messages, respectively. 

For our upper bounds we assume that /'(n) = ±1 always. If |/'(n)| > 1 we could simulate it with |/'(n)| 
arrivals of ±1 updates with 0(logmax/'(n)) overhead, as shown in appendix [Cl 

3.1 Partitioning time 

We use an idea from Tao et. al. m to first divide time into manageable blocks. At the end of each block 
we know the values n and /(n) exactly. Within each block, we know these values only approximately. The 
division into blocks is deterministic and the same for both our deterministic and randomized algorithms. 
Our division ensures that the change in v{n) over each block is at least 1/5, which simplifies our analysis. 

• The coordinator requests the sites’ values Ci and fi at times no = 0, ’^• 1 ; ’^• 2 , • • ■ and then broadcasts a 
value r. These values will be defined momentarily. 
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• Each site i maintains a variable Ci that counts the number of stream updates f'(ji) it received since the 

last time it sent Ci to the coordinator. It also maintains fi that counts the change in / it received since 
the last broadcast nj. Whenever Ci = , site i sends Ci to the coordinator. This is in addition to 

replying to requests from the coordinator. 

• The coordinator maintains a variable i. After broadcasting r, i is reset to zero. Whenever site i sends 
Ci, the coordinator updates t = t + Ci. 

• The coordinator also maintains variables /, j, and tj. At the first time nj > rij-i at which t > tj, 

the coordinator requests the Ci and fi values, updates / and r, sets tj+i = broadcasts r, and 

increments j. 

• When r is updated at the end of time nj, it is set to r if 2’'2fc < \ f{nj)\ < 2’'4fc and zero if \ f{nj)\ < Ak. 

Thus we divide time into blocks Bq, Bi ,..where Bj = [rij + 1, n^+i]. Algebra tells us some facts: 

• < rij^i — Uj < 2^k. 

• If r = 0 then |/(n) — f{nj)\ < k and |/(n)| < 5k for all n in Bj. 

• If r > 1 then |/(n) — f{nj)\ < 2^k and 2’'fc < \f{n)\ < 2^5k for all n in Bj. 

The total number of messages sent in block Bj is at most 5k: we have at most 2k updates from sites, k 
in requests from the coordinator, k replies from each site, and k broadcast at rij+i. 

The change in variability Vj = u(nj+i) — V{nj) over block Bj is 


^K+i) - v{nj) 




r k/5k 
\2''k/2^5k 


if r = Ol 
if r > IJ 


> 1/5 


And therefore the total number of messages (all O(logn) bits in size) is bounded by 25kv + 3fc. 


3.2 Estimation inside blocks 

What remains is to estimate f{n) within a given block. Since we have partitioned time into constant- 
variability blocks, we can use the algorithms of Cormode et. al. m 0 and Huang et. al. [8] almost directly. 
Both of our algorithms use the following template, changing only condition, message, and update: 

• Site i maintains a variable di that tracks the drift at site i, defined as the sum of f'{n) updates received 
at site i during the block. That is, /(n) — f{nj) = J2i di- 

• Site i also maintains a variable Si that tracks the change in di since the last time site i sent a MESSAGE. 
Si is initially zero. 

• The coordinator maintains an estimate di for each value di. These are initially zero. It also defines 
two estimates based on these df. 

o For the global drift: d = J2i di- 
o For f{n): f{n) = f{nj) -f d{n). 

• When site i receives stream update f'{n), it updates di. It then checks its condition. If true, it sends 
a MESSAGE to the coordinator and resets Si = 0. 

• When the coordinator receives a MESSAGE from a site i it updates its estimates. 

3.3 The deterministic algorithm 

Our method guarantees that at all times n we have |/(n) — f{n)\ < s\f{n)\. It uses 0{kv/e) messages in 
total. 

• Condition: true if |i5i| = 1 and r = 0, or if |i5i| > e2''. Otherwise, false. 

• Message: the new value of di. 

• Update: set di = di. 
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Let S = be the error with which d estimates d = ■ The error in / is 

|/(n)-/(n)| = |(/(nj)+d(n)) - (/(rij) + d(n) + (5(n))| = |(5(n)| 

When r > 1 we have \Bj\ < 2^k, and we always have that 6 < \Bj\. Since we constrain 6i < e2’' at the end 
of each timestep, we have |/(n) —/(n)| < e2^k < e|/(n)|. 

We also use at most 2k/e messages for the block. If r = 0 then the number of messages is at most k. If 

r > I, then since a site must receive e2'' new stream updates to send a new message, and since there are at 

most 2’"fc stream updates in the block, there are at most k/e messages. 

In each block the change in v is at least 1/5, so the total number of messages is at most 5kv/e. 

3.4 The randomized algorithm 

Our method uses 0{;\fkv/£) messages (plus the time partitioning) and guarantees that at all times n we 
have P(|/(n) -/(n)| > e|/(n)|) < 1/3. 

The idea is to estimate the sums df and d~ separately. The estimators for those values are independent 
and monotone, so we can use the method of Huang et. al. [5] to estimate the two and then combine them. 

Specifically, the coordinator and each site run two independent copies and A~ of the algorithm. 
Whenever f'{n) = +1 arrives at site i, a +1 is fed into algorithm 4+ at site i. Whenever f'{n) = —1 
arrives at site i, a +1 is fed into algorithm A~ at site i. So the drifts d'l and d~ at every site will always 
be nonnegative. At the coordinator, the estimates df and d^ are tracked independently also. However, the 
coordinator also defines d = — d~ and f{n) = f{nj) + d{n). The definitions for algorithm A^ are 

• Condition: true with probability p = min{l, 3/e2’’fc^/^}. 

• Message: the new value of df. 

• Update: set df = df - 1 + 1/p. 

The following fact 13.11 is lemma 2.1 of Huang et. al. [8]. Our algorithm effectively divides the stream 
f'{Bj) into two streams \ f'{B'^)\. Since these streams consist of +1 increments only we run the algorithm 
of Huang et. al. separately on each of them. At any time n, stream \f'{B'^)\ has seen df{n) increments at 
site i, and lemma 2.1 of Huang et. al. guarantees that the estimates df{n) for the counts df{n) are good. 

Fact 3.1. E{df) = df and Var{df) < 1/p^. 

This means that E{d^) = Yl,iE{.df) = therefore that E{d) = Yl,iE{df — d~) = Yl,idi- 

Since the estimators df are independent, the variance of the global drift is at most 2k/p^. By Chebyshev’s 
inequality. 


P(|«W|>£2--t) < < 1/3 

Further, the expected cost of block Bj is at most p\Bj\ < (3/e2’’fc^/^)(2’'2A:) < 30k^/‘^Vj/e. 

4 Lower bounds 

In this section we show that the dependence on v is essentially necessary by developing deterministic and 
randomized lower bounds on space+communication that hold even when v is small. Admittedly, this is not 
as pleasing as a pure communication lower bound would be. On the other hand, a distributed monitoring 
algorithm with high space complexity would be impractical for monitoring sensor data, network traffic 
patterns, and other applications of the model. Note that in terms of space+communication, our deterministic 
lower bound is tight up to factors of fc, and our randomized lower bound is within a factor of log(n) of that. 
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For these lower bounds we use a slightly different problem. We call this problem the tracing problem. 
The streaming model for the tracing problem is the standard turnstile streaming model with updates f'{n) 
arriving online. The problem is to maintain in small space a summary of the sequence / so that, at 
any current time n, if we are given an earlier time t as a query, we can return an estimate f{t) so that 
1/(0“/(01 ^ is large (one in the deterministic case, 2/3 in the randomized case). We call this the 

tracing problem because our summary traces / through time, so that we can look up earlier values. 

In appendix[D]we show that a space lower bound for the tracing problem implies a space+communication 
lower bound for the distributed tracking problem. Here, we develop deterministic and randomized space lower 
bounds for the tracing problem. 

4.1 The deterministic bound 

The deterministic lower bound that follows is similar in spirit to the lower bound of Tao et. al. m- It uses 
a simple information-theoretic argument. 

Theorem 4.1. Let e = 1/m for some integer m > 2, let n > 2m, let c < 1 constant, and let r < n° and 
even. If a deterministic summary S{f) guarantees, even only for sequences for which v{n) = that 

\f{t) — f{t)\ < efft) for all t < n, then that summary must use bits of space. 

The full proof appears in appendix |E1 At a high level, the sequences in the family take only values m or 
TO -|- 3, and each sequence is defined by r of the n timesteps. If the new timestep t is one of the r chosen for 
our sequence, then we flip from to to to -I- 3 or vice-versa. All of these sequences are unique and there are 
2 a(riogn) of them. 

4.2 The randomized bound 

We use a construction similar to the one in our deterministic lower bound to produce a randomized lower 
bound. In order to make the analysis simple we forego a single variability value for all sequences in our 
constructed family, but still maintain that they all have low variability. C is a universal constant to be 
defined later. 

Theorem 4.2. Choose e < 1/2, v > 32400glnC, and n > 3v/e. If a summary S{f) guarantees, even only 
for sequences for which v(n) < v, that P{\fit) — f{t)\ < ef(t)) > 99/100 for all t < n, then that summary 
must use Ll{v/s) bits of space. 

We prove this theorem in two lemmas. In the first lemma, we reduce the claim to a claim about the 
existence of a hard family of sequences. In the second lemma we show the existence of such a family. 

First a couple of definitions. For any two sequences / and g define the number of overlaps between f 
and g to be the number of positions 1 < t < n for which \f{t) — g{t)\ < e ma.x{f {t), g{t)}. Say that / and g 
match if they have at least overlaps. 

Lemma 4.3. Let T be a family of sequences of length n and variabilities < v such that no two sequences 
in T match. If a summary S{f) guarantees for all f in IF that P{\f{t) — f{t)\ < sf(t)) > 99/100 for all 
t < n, then that summary must use H(log |J^|) bits of space. 

The full proof appears in appendix [F] At a high level, if S{f) is the summary for a sequence /, we can 
use it to generate an approximation / that at least 90% of the time overlaps with / in at least -^n positions. 
Since no two sequences in T overlap in more than -^n positions, at least 90% of the time we can determine 
/ given /. We then solve the one-way Indexproblem by deterministically generating F and sending a 
summary S{f{x)), where x is Alice’s input string of size N = log 2 |J^|, and f{x) is the xth sequence in F. 
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Lemma 4.4. For all e < 1/2, v > 32400£ InC, and n > 3v/e, there is a family T of size of sequences 

of size n such that: 

1. no two sequences match, and 

2. every sequence has variability at most v. 

The full proof appears in appendix [Uj At a high level, sequences again switch between m = 1/e and 
m+3, except that these switches are chosen independently. We model the overlap with a Markov chain; the 
overlap between any two sequences is the sum over times t of a function y applied to the states of a chain 
modeling their interaction. We then apply a result of Chung, Lam, Liu, and Mitzenmacher [5] to show that 
the probability that any two sequences match is low. Lastly, we show that not too many sequences have 
variability more than v, by proving that they usually don’t switch between m and m+3 many times. 


5 Variability as a framework 

In section[5]we proposed the /-variability niin{l, \y^\} as a way to analyze algorithms for the contin¬ 
uous monitoring problem {k, /, e) over general update streams. However, our discussion so far has focused on 
distributed counting. In this final section we revisit the suitability of our definition by mentioning extensions 
to tracking other functions of a dataset defined by a distributed update stream. We include fuller discussions 
of these extensions in the appendices. 

5.1 Tracking item frequencies 

We can extend our deterministic algorithm of section [3] to the problem of tracking item frequencies, in a 
manner similar to that in which Yi and Zhang m m extend the ideas of Cormode et. al. [3] to this 
problem. The definition of this problem, the required changes to our algorithm of section [3] needed to solve 
this problem, and a discussion of the difficulties in finding a randomized algorithm, are discussed in appendix 

El 


5.2 Aggregate functions with one site 

In this subsection we consider general single-integer-valued functions / of a dataset. When there is a single 
site, the site always knows the exact value of /(n), and the only issue is updating the coordinator to have 
an approximation f{n) so that \f{n) — f{n)\ < ef{n) for all n. We can show that this problem of tracking / 
to e relative error when k = 1 has an 0(iu(n))-word upper bound, where here v{n) is the /-variability. The 
algorithm is: whenever |/ — /| > £/, send / to the coordinator. The proof is a simple potential argnment 
and is deferred to appendix ID 

Along with our lower bounds of section |31 this upper bound lends evidence to our claim that variability 
captures the difficulty of communicating changes in / that are due to the non-monotonicity of the input 
stream. A bolder claim is that variability is also useful in capturing the difficulty of the distributed compu¬ 
tation of a general function that is due to the non-monotonicity of the input stream, but the extent to which 
that claim is true has yet to be determined. 
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A Variability of nearly monotone /(n), theorem 12.1 

Theorem A.l. Let f~{n) = X]t-/'(t)<o f {t)>o f ■ If ihere is a monotone nonde¬ 

creasing function j3{t) > 1 and a constant tg such that for all n > to we have f~(n) < (3{n)f{n), then the 
variability Yn=i l/'W//(0l *5 C>(/3(n) log(/3(n)/(n))). 

Proof. For i = define ti to be the earliest time t such that f~^{ti) > 2/+(ti_i), where k is the 

smallest index such that t^ > n. (If k is undefined, define fc = n + 1.) 

The cost If'it)/fit)\ is constant. We bound the cost J27=to follows. We partition 

the interval [to,tk) into subintervals [to,ti), ..., [tk-i,tk) and sum over the times t in each one. There are 
at most 1 + log/+(n) of these subintervals. 


n 


E 

t=to 


fit) 


k ti — 1 


^E E 


i—l t—ti — 
k 


\f'it)\ ^ 1 + Pi'^f) 


< 


y,i + ,9(„))Afcil±£Azi2 

i=i nP-i) 

< 4(1+ /3(n))(l + log/+(<*-!)) 

< 4(1 + /3(n))(l + log(2(l + /3(n))/(n))) 


because the condition / (t) < P{t)f{t) implies f(t) > /+(t)/(l + /3(t)) and / (t) < /^(t). 


□ 


B Variability of biased coin flips, theorem 12.4 

Theorem B.l. If f'{t) is a sequence of i.i.d. ±1 random variables with P{f'(t) = 1) = (l + /r)/2 then 
E{v{n)) = 0{^). 

Proof. We show that, with high probability, f{t) > pt/2 for times t > to = toin) when n is large enough 
with respect to fi. 

We write f{t) = —t + 2Yt, where Yt = X]s=i Vs i® ^ Bernoulli variable with mean We have 

that P{f{t) < pit/2) = P{Yt < ^^t) and that E{Yt) = ^^t. Using a Chernoff bound, PiYt < l^t) < 
exp(—/Ltt/16). Let A be the event 3t>to (fit) < pit/2). Then P{A) < J2t=to by the union bound. 

We can upper bound this sum by 


t=to 1*0 


Taking to = {16/pi) \n{17n/pi) gives us P{A) < 1/n. Thus 


E 


Eniin{l, 




J'jt) 

'fit) 


[} < to + 


n' 


1 - 


E- = 


O 


logn 


yielding the theorem. 


□ 
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C Simulating large |/'(n)|, section [3] 


We noted in section[3]that we can simulate \ f'{n)\ > 1 with |/'(n)| arrivals of ±1 updates with 0(log max f'{n)) 
overhead. To simplify notation we define l//(n) = 1 when /(n) = 0 and assume that /(n) > 0 always. 


Theorem C.l. For f'{n) > 1 we have < — 1 we have 

St=o f(n)+t — ! where H{x) is the xth harmonic number. 


Proof. For f {n) > 1, we have f(n-i)+t 

If f'(n) < —1 and /(n) > 1, then X]t=o 
and if f{n) = 0, add another |/'(n)|//(n). 


_ /'(") 
/(") 
1 < 
/(")+* — 


1 f'M-t < /'(”) 

/(n) /(„-!)+( — /(n) 




/'(") y^/'(") 1 

/(n) Z^t=l t ■ 
\f'in)\ \ ^ 2 \f'(n)\ 
fin) ) - fin) > 

□ 


D Tracing and distributed tracking, section |4] 

Lemma D.l. Fix some e. Suppose that the tracing problem has an fl(L^(ri))-bit space deterministic lower 
bound. Also suppose that there is a deterministic algorithm A for the distributed tracking problem that uses 
r2(Ce(n)) bits of communication and ifl{Sg{n)) bits of space at the site and coordinator combined. Then we 
must have C + S = il{L). 

Further, if we replace “deterministic” with “randomized” in the preceding paragraph, the claim still holds. 

Proof. Suppose instead that for all constants c < 1 and all Uq there is an n > uq such that C{n) + S(n) < 
cL{n). Then we can write an algorithm B for the tracing problem that uses L'{n) < cL(n) bits of space: 
simulate A, recording all communication, and on a query t, play back the communication that occurred 
through time t. 

At no point did we use the fact that A guarantees P{\f{t) — f{t)\ < ef{t)) = 1, so the claim still holds if 
we change the correctness requirement to P > 2/3. □ 


E Deterministic lower bound, theorem 14.11 

Theorem E.l. Let e = 1/m for some integer m > 2, let n > 2m, let c < 1 constant, and let r < n“ and 
even. If a deterministic summary S{f) guarantees, even only for sequences for which v(n) = that 

\f{t) — f(t)\ < ef{t) for all t < n, then that summary must use n{^-f^^v(n)) bits of space. 

Proof. We construct a family of input sequences of length n and variability Choose sets of r different 

indices 1... n so that there are choose(n, r) such sets. 

For each set S we define an input sequence fs. We define /s(0) = m and the rest of fs recursively: 
fs{t) = /s(^—1) if t is not in S, and fs{t) = {2m + 3) — fs{t—l) if t is in S. (That is, switch between m 
and m + 3.) 

If A and B are two different sets, then fAf^fs- let i be the smallest index that is in one and not the 
other; say i is in A. Then /a( 1 ... (i-1)) = /b( 1 ■ • • (*-!)), but fA(i) yf /a(*- 1) = fB(i-l) = /b(*)- 

The variability of any fs is There are r 12 changes from to to to + 3 and another r 12 from to + 3 

to TO. When we switch from to to to + 3, we get \ f'{t)/f{t)\ = 3/(to + 3), and when we switch from to + 3 
to TO, we get \f'{t)/f{t)\ = 3/to. Thus Et 1^1 = § 4^+3) = 

There are choose(n,r) > {n/r)” input sequences in our family, so to distinguish between any two input 
sequences we need at least rlog(n/r) = n(rlogn) bits. Any summary that can determine for each t the 
value f{t) to within ±ef{t), must also distinguish between f{t) = m and f{t) = to + 3, since there is no 
value within em of to and also within e{m + 3) of to + 3. Since this summary must distinguish between 
f{t) = TO and f{t) = TO + 3 for all t, it must distinguish between any two input sequences in the family, and 
therefore needs r2(r log n) bits. □ 
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F Randomized lower bound, lemma 14.3 

Lemma F.l. Let J- be a family of sequenees of length n and variabilities < v such that no two sequences 
in T match. If a summary S{f) guarantees for all f in T that P{\f(t) — f{t)\ < sf{t)) > 99/100 for all 
t < n, then that summary must use ll(log |J^|) bits of space. 

Proof. Let S{f) be the summary for a sequence /, and sample /(I)... f{n) once each using S{f) to get /. 
Let A be the event that |{< : \fit) — f{t)\ < ef{t)}\ > By Markov’s inequality and the guarantee in 

the premise, we must have P{A) > 9/10. 

Let w define the random bits used in constructing S{f) and in sampling /. For any choice lo in A we 
have that / overlaps with / in at least -^n positions, which means that / overlaps with any other g £ P 
in at most positions: at most the in which / and g could overlap, plus the Pn in which / and / 
might not overlap. 

Define F C to be the sequences g that overlap with / in at least positions. This means that when 
uj £ A we have |F| = 1 , and therefore with probability at least 9/10 we can identify which sequence / had 
been used to construct S{f). 

We now prove our claim by reducing the Indexjv problem to the problem of tracing the history of a 
sequence /. The following statement of Indexer is roughly as in Kushilevitz and Nisan [9]. There are two 
parties, Alice and Bob. Alice has an input string x of length N = log 2 \P\ and Bob has an input string i 
of length log 2 N that is interpreted as an index into x. Alice sends a message to Bob, and then Bob must 
output Xi correctly with probability at least 9/10. 

Consider the following algorithm for solving Indexer. Alice deterministically generates a family P of 
sequences of length n and variabilities < v such that no two match, by iterating over all possible sequences 
and choosing each next one that doesn’t match any already chosen. Her log 2 \P\ bits of input x index a 
sequence / in P. Alice computes a summary S{f) and sends it to Bob. After receiving S{f), Bob computes 
f{t) for every t = 1.. .n, to get a sequence /. He then generates P himself and creates a set F of all 
sequences in P that overlap with / in at least positions. If F = {/}, which it is with probability at 
least 9/10, then Bob can infer every bit of x. 

Since the Index^v problem is known to have a one-way communication complexity of Ll(N), it must be 
that |F(/)| = H(log|F|). □ 

G Randomized lower bound, lemma 14.4 

Lemma G.l. For alls < 1/2, v > 32400elnC, andn > 3v/e, there is a family P of size of sequences 

of size n such that: 

1 . no two sequences match, and 

2 . every sequence has variability at most v. 

Proof. We construct P so that each of the two items holds (separately) with probability at least 4/5. Let 
m = 1/e. To construct one sequence in P, first define /(O) = m with probability 1/2, else /(O) = m-|-3. 
Then, for t = 1.. .n: define f(t) = {2m+3) — f{t—l) with probability p = vjGen, else f(t) = f(t—l). That 
is, switch from m to 771-1-3 (or vice-versa) with probability p = vj^en. 

We first prove that the probability is at most 1/5 that any two sequences / and g match. We have that 
P(/(0) = 5 ( 0 )) = 1/2. If at any point in time we have f{t) = g{t), then P{f(t+l) = g{t+l)) = a = l — 2p{l—p) 
and P{f(t+1) 7 ^ g{t+l)) = 1 — a = 2p{l—p). Similarly, if f{t) 7 ^ g{t), then P{f(t+1) = g{t+l)) = 1 — a and 
^(/(i+l)?^5(^ + l)) = a- 

The overlap between / and g is the number of times t that f{t) = git). We model this situation with 
a Markov chain M with two states, c for “same” (that is, f = g) and d for “different” (/ 7 ^ g). Let st be 
the state after t steps, and let pt = ipt{c),pt{d)) be the probabilities that M is in state c and d after step 
t. The stationary distribution tt = (1/2,1/2), which also happens to be our initial distribution. We can 
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model the overlap between / and g by defining a function y{st) = 1 ii St = c and y{st) = 0 otherwise; then 
Y = yi^t) is the overlap between / and g. The expected value E{y[TT)) of y evaluated on tt is 1/2. 

The (l/8)-mixing time T is defined as the smallest time T such that i||M‘ro — 7r||i < 1/8 over all initial 
distributions rg. Let rg be any initial distribution and rt = M^r^. If we define Aj = rt(c) — 7r(c), then 
At = (2a—l)‘Ag. We can similarly bound \rt{d) — 7r(d)|, so we can bound 

^ ln(8) ^ 3 ^3^3 9en 

~ ln(l/(2a—1)) “ (1 —(2a—1)) “ 2p{l—p) ~ 2p v 

since 1 — p> 1/2 and since l/ln(l/x) < 1/(1 —x) for x in (0,1). With this information we can now apply 
a sledgehammer of a result by Chung, Lam, Liu, and Mitzenmacher [2]. Our fact IG.2| is their theorem 3.1, 
specialized a bit to our situation: 

Fact G.2. Let M be an ergodic Markov chain with state space S. Let T be its {l/S)-mixing time. Let 
(si,...,s„) denote an n-step random walk on M starting from its stationary distribution tt. Let y be a 
weight function such that E^ylrr)) = p. Define the total weight of the walk by Y = Then there 

exists some universal constant C such that P{Y > (1 + 5)pm) < C exp(—(5^/in/72r) when 0 < (5 < 1. 

Specifically, this means that P{Y > ^n) < C exp(—u/(25 • 72 • 9 • e)). Since v is large enough, we can 
also write P < exp(—u/32400e). If |J^| = ■iexp(u/(2 • 32400e)), then by the union bound, with probability 
at least 4/5, no pair of sequences f,g matches. 

We also must prove that there are enough sequences with variability at most v. The change in variability 
due to a single switch from m to m+3 (or vice-versa) is at most 3/m = 3e. For any sequence /, let l/j = 1 
if / switched at time t, else Ut = 0. The expected number of switches is v/6e; using a standard Chernoff 
bound, P{J2t — 2u/6e) < exp(—u/lSe) < 1/10. Suppose we sample N sequences and B of them have 
more than 2vjQe switches. In expectation there are at most E{B) < -^N that have too many switches. By 
Markov’s inequality, P{B > N/2) < 1/5, so we can toss out the < N/2 bad sequences. This gives us a final 
size of of ^ exp(u/(2 • 32400e)). □ 


H Tracking item frequencies, section 15.1 


Problem definition The problem of tracking item frequencies is only slightly different than the counting 
problem we’ve considered so far. In this problem there is a universe U of items and we maintain a dataset 
Dif) that changes over time. At each new timestep n, either some item I from [/ is added to H, or some item 
^ from D is removed. This update is told to a single site i] that is, site i{n) receives an update f'i{n) = ±1. 

The frequency f(,{t) of item £ at time t is the number of copies of i that appear in D{t). The first 
frequency moment Pi{t) at time t is the total number of items |Zl(t)|. The problem is to maintain estimates 
fi{n) at the coordinator so that for all times n and all items £ we have that P{\fe{n) — f£{n)\ < eFi{n)) is 
large. 

Since in this problem we are tracking each item frequency to eFi(n), we use Fi-variability instead, 
defining v'{t) = min{l, 1 /Fi{t)}. 


H.0.1 Item frequencies with low communication 

We first partition time into blocks as in section IXTl using f = Fi. That is, at the end of each block we know 
the values n and Fi(n) deterministically, and also that either r = 0 holds or that Fi(rij) is within a factor 
of two of Fi{nj-i). 

For tracking during blocks we modify the deterministic algorithm so that each site i holds counters da 
and Su for every item 1. It also holds counters fn of the total number of copies of £ seen at site i across all 
blocks. 

At the end of each block, each site i reports all fa > e2’’/3 (using the new value of r). If site i reports 
counter fa then it starts the next block with dn = dm, = 0; otherwise, dm is updated to dm + dm and then 
dm is reset to zero. Within a block r > 1, the condition is true when dm > £2''/3. 
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The coordinator maintains estimates fu of fu for each site i and item £. Upon receiving an update Su 
during a block the coordinator updates its estimate fu = fu + 5u- 

Estimation error The total error in the estimate fuin) at any time n is the error due to du plus the 
error due to Su- In both cases these quantities are bounded by £2’'/3 < sFi{n)/3. 

Communication The total communication for a block is the total communicated within and at the end 
of the block. Within a block, all 6 u start at zero, and there are at most 2’'fc updates, so the total number of 
messages sent is 3fc/e. At the end of a block, fu > e2’'/3 is true for at most 12k/e counters fu- Therefore 
the total number of messages 0(|u(n)). 

H.0.2 Item frequencies in small space+communication 

The algorithm so far uses \U\ counters per site, which is prohibitive in terms of space. In [3] Cormode and 
Muthukrishnan show that in order to track over a non-distributed update stream each fe{n) so that for all i 
and all times n we have P(|/^(n) —/^(n)| < eFi{n)/i) > 8/9, it suffices to randomly partition each item in 
U into one of 27/s classes using a pairwise-independent hash function h, and to estimate /i(n) as fh{e){n)- 
The 21 je counters and the hash function h together form their Count-Min Sketch [3]. 

Similarly, in m m Ganguly and Majumder adapt a data structure of Gasieniec and Muthukrishnan m, 
which they call the GR-precis, to deterministically track each fe{n) to eFi(n)/3 error. This data structure 
uses I rows of counters, and estimates fe{n) as the average over rows r of fh{r,f.){n). (Ganguly and 

Majumder actually take the minimum over the rows r, but the average works too and yields a linear sketch.) 

In either case, we can first reduce our set of items f to a small number of counters c, and instead of 
tracking fu we track fu for each counter c. The coordinator can then linearly combine its estimates fu 
to obtain estimates fu for each item i. This introduces another eFi{n)/3 error, yielding algorithms that 
guarantee 

• P{\fie{n) — fu{n)\ < eFi{n)) = 1 in logn) bits of space -|- communication, and 

• Pilfuin-) —fu{n)\ < eFi{n)) > 8/9 in 0{k\og\U\ + jv{n)\ogn) bits of space -I- communication. 

H.0.3 Remarks 

We obtain a randomized communication bound of 0{^v{n)) messages, but it might be possible to do better. 
In [S] Huang et. al. both develop a randomized counting algorithm (0(— logn) messages) and also extend 
it to the problem of tracking item frequencies to get the same communication bound. Unfortunately, their 
algorithm appears to require the total variance in their estimate at any time t < n to be bounded by a 
constant factor of the variance at time n. This is only guaranteed to be true when item deletions are not 
permitted (and Fi grows monotonically). We avoid this problem in section 13.41 for tracking / = Tj by 
deterministically updating Fi at the end of each block. For this problem, though, deterministically updating 
all of the large fu at the end of each block could incur 0(l/e) messages. Whether it is also possible to 
probabilistically track item frequencies over general update streams in 0 {^v{n)) messages remains open. 


I Aggregate functions with one site, section 15.2 


The single-site algorithm of section IQ] is: whenever \f — f\ > sf, send / to the coordinator. 


Proof. If /(n) = 0 then v'{n) = 1. Also, if f{n) changes sign from /(n—I), then v'{n) = I. 
intervals over which f{n) is nonzero and doesn’t change sign. Over such an interval, let $(n) = 


So consider 

I /(")-/(") I 

I /(") I- 


14 






If at time n we update / then $(n) = 0. Otherwise, 


$(n) = 


|/(n-l)-/(n-l) + f(n)| |/(n-l) -/(n-l)| |/'(n)| 


l/WI 


l/WI 

< $(n—1) + 


< 


< 


\fin)\ 

(l + $(n-l))|/'(n)| 

\f{n)\ 


\fin)\ 

\f{n)\ + |/'(n)| 

l/WI 


$(n—1) 


l/WI 

l/(n)| 


Since $(n) < e we have |d>'(n)| < (l+e)|y^|. We only send a message each time that $ would be more than 
e, so the total number of messages sent is at most the total increase in which is X]"=i min{l, \y^\}. □ 
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